Since I want to analyze pickups in Month, Day, time, and weekdays, I created new variables.
To check if there are any N/As I searched for any N/As:
##
## FALSE
## 442986
About: This data contains uber pick up details from 01/01/2015 to 06/30/2015 with weather and holiday variables included. This data contains data of approximately 26000 pick ups. The purpose of this EDA would be to see relationship between pickups and other variables such as place, weather, and holiday
Structure of uber dataset:
## 'data.frame': 26058 obs. of 17 variables:
## $ pickup_dt : Factor w/ 4343 levels "1/1/2015 1:00",..: 1 1 1 1 1 1 12 12 12 12 ...
## $ borough : Factor w/ 6 levels "Bronx","Brooklyn",..: 1 2 3 4 5 6 1 2 3 4 ...
## $ pickups : int 152 1519 0 5258 405 6 120 1229 0 4345 ...
## $ wind_spd : num 5 5 5 5 5 5 3 3 3 3 ...
## $ visib_mile : num 10 10 10 10 10 10 10 10 10 10 ...
## $ temp_F : num 30 30 30 30 30 30 30 30 30 30 ...
## $ dew_point : num 7 7 7 7 7 7 6 6 6 6 ...
## $ sea_level_press: num 1024 1024 1024 1024 1024 ...
## $ liq_precp_1hr : num 0 0 0 0 0 0 0 0 0 0 ...
## $ liq_precp_6hr : num 0 0 0 0 0 0 0 0 0 0 ...
## $ liq_precp_24hr : num 0 0 0 0 0 0 0 0 0 0 ...
## $ snow_depth_in : num 0 0 0 0 0 0 0 0 0 0 ...
## $ hday : Factor w/ 2 levels "N","Y": 2 2 2 2 2 2 2 2 2 2 ...
## $ month : chr "01" "01" "01" "01" ...
## $ hour : chr "01" "01" "01" "01" ...
## $ day : chr "01" "01" "01" "01" ...
## $ wday : chr "Thursday" "Thursday" "Thursday" "Thursday" ...
Uber data set has 17 variables including newly created variables from above and about 26000 rows of information.
Summary of pickups variable:
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.0 1.0 86.0 547.2 506.0 7883.0
Summary shows median number of pickups is 86 times and max was 7883 times in one borough.
Let’s see how many times of pick up is most common.
To see better distribution and et rid of long tail, I used log scale. I see multimodal distribution with highest count at around 900 pickups.
NOw, since I think rain would affect lots of uber calls, I wanted to see how much it rained
Checking summary of liquid precipitation:
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.000000 0.000000 0.000000 0.003821 0.000000 0.280000
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.00000 0.00000 0.00000 0.02607 0.00000 1.24000
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.00000 0.00000 0.00000 0.09104 0.05167 2.10000
To see how much rain is most common in one hour if it rained
In 6 hours:
In 24 hours:
For snow:
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.000 0.000 0.000 2.536 3.167 19.000
To see better distribution, I excluded 0 snow depth since it was surely too many days without snow compared to days with snow and was able to find out around 8.0 inches was most common snow depth.
And for temperature:
Interestingly, it seems like temperature has bimodal form of distribution.
NYC data set has about 26000 rows of data and have 15 variables that has date, number of pickups on certain day and time, borough in Newyork, weather information such as temperature, liquid precipitation for 1, 6 and 24 hours, sea level pressure, dew point, snow depth if it snowed, visible mile, and if the day was holiday or not.
Main features of interest are number of pickups, temperature, visibility in 10th of miles, month, hour, holiday or not, snow depth and liquid precipitation. These features can impact number of pickups so I think it’s important to understand how these features can change the business.
I thought dividing data into month and hour would help me investigate deeper into each data so I made another columns that have month and hour variable.
Usually histograms were long tailed since kind of variables are something that does not happen often. For exmaple, rain can pour a lot, but that is only on occasions. For pick up histogram is was expected to see high numbers in lower pick up counts since high number of pick ups is hard to achieve.
Main output feature would be pick-ups. Before investigating relationship with pick ups, I also wanted to check how rain and snow impacts visibility only to make sure if dataset is properly interpreted. Surely, it is expected to see bad visibility in a bad weather like lots of rain and snow.
As I expected bad weather gave bad visibility.
Also just to check I decided to plot snow vs month plot to see graph I’m expecting
I found out that it snows more often on February.
What about month vs rain?
It can be seen that ther was more liquid precipitation during January.
Now to know how these variables impact pick ups, I first plotted liquid precipitation vs pickups
According to median line, it seems like there is not much relationship between rain and pickup.
for snow:
For snow, unlike rains, it is comparably more diversed. It seems like snow does not effect much on pick up calls.
For more details I divided into each boroughs:
From above plot, I was clearly able to see that snow depth does not affect uber pickups since it’s uniformly distributed other than Manhattan. For Manhattan it seems like median graph is unimodal at around 10inch of snow. It seems like you get more pickups on non-snowing days, but it’s because you have more non-snowing days than snowing days. To show this, I added median summary line on top of the graph and you can see that it’s uniform shaped.
Which month is most likely to get a pick up?
It seems like as the weather gets warmer people calls for pickup more And I was able to see that it applies to all the borough. In Manhattan, February got more pickups than on March. Maybe it’s because it snowed most on February. And since Manhattan is busy city, to go to work there is high possibility people called Uber because of snow.
which hour is most likeyly to get a pick up?
It can be seen that during night time there are more pick ups. During morning it can be seen that it is getting high pick ups, maybe because in some borough, citizens take uber to work. Let’s see which borough gets most pickups.
As I expected it is certain to get more pick ups in Manhattan. I was thinking at EWR(Newark Airpot), there would be more pickups, but seems like there is less pickups, but would get paid more since they are more likely to drive longer distances. It would’ve been better if the dataset had prices column as well.
To check the weirdness of such low pickup rates in EWR, I decided to check if EWR data set have some problems, and I found out all the pickups are 0 in EWR.
Let’s see if temperature affects pickups My guess is that as it gets colder and hotter, there wil be more pick ups.
##
## Pearson's product-moment correlation
##
## data: uber$pickups and uber$temp_F
## t = 10.302, df = 26056, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## 0.05159033 0.07577533
## sample estimates:
## cor
## 0.06369218
Through median line, it can be seen that during around 60 F there are more pickups but it’s not definite. I tested the relationship between pickups and temperature using correlation coefficient, but it was not meaningful.
To see if holiday lets uber drivers make more money
Unlike what I expected, holidays doesn’t help drivers to get more pick ups. I checked mean of each, but mean won’t help much since there are some outliers. In this case, median would be more accurate to look at and median is almost the same in both holidays and none holidays. finally I decided to see how visibility in miles to nearest tenth affected pickups
After plotting I found out this graph is not accurate since there are more days that had better visibility so will have more pickups added. To fix that problem, I added stat summary line that shows median values of pick up and found out visibility does not affect number of pickups
I also wanted to see which week day would have most pickups so I created new variable wday. And here is a plot:
It can be seen that during weekends there are more uber pickups called and after that is friday that got most pickups.
After examining holiday vs non-holiday, I thought dividing working day and none working day would be good idea.
According to the plot, it can be seen during non-working days, uber drivers gets slightly more pickups than working days.
There were some strong relationships between pickups and time and week day as well as borough.
I was also able to find that on 2015, February got most snows in NYC
First just to confirm my thinking of visibility I made a plot of weather over visibility. And as I was expecting, I was able to see that in a bad weather, visibility is bad. I was also able to find that in 2015, Newyork got more snow on February than January and March. After that I was able to see how month or hour impacts pickup numbers.
Relationship was most clear in number of pickups vs borough plot. It shows Manhattan gets significantly more pickups than other boroughs.
To see how hour affects pick up numbers over month, I first wanted to see how morning to afternoon affected it.
From the two plots, I was able to find out that there are certain hours that have higher pickups than usual. For example, during working hours, at 17, there is high chance to get more pickups and during night time at 20 as well.
After observing above plot, I decided to see how it’s different in each borough, so I divided plots to see:
After observing above plot, I was only able to clearly see how hours impack pickups in each month in Manhattan. In other boroughs, it seems like time does not impack pickups much.
To see pickups in each hour over the different borough:
From this plot, I was able to see that in all of the areas, there was similar pattern but a lot more pickups in Manhattan.
To see how pick ups are distributed along the hours on working and non working day, I creted this plot:
I see a lot more working day dots, since there are more working days on calendar. To get more obervation, I again divided into each borough:
I was hoping to see more clear observation on how working day and non-working day impact on pickups, and as the graph shows, there is little bit more bump on pickups during Non-working day. During weekends, it’s clear to see the small bump of pickups, unlike holidays that doesn’t show any differences.
I also wanted to make sure why over the month number of pickups were increasing. My guess was because temperature is getting warmer and to see if my assumption is right I plotted this graph:
As it shows, temperature gets warmer as month and pickup numbers increases. It would have been more interesting to see full year and see number of pick ups over whole season. In the plot of temperature vs pickups, I was not able to see full relationship other than around 60F there is pick. Maybe there is certain temperature that people like to get on a uber.
To see why only in Manhattan got more pickups on February, I decided to add snow_depth as a color.
As the plot shows, there certainly was more snow on February and only in Manhattan there were more pickups.
Now I think there might be a impack of rain only in Manhattan. To check I did the same as snow.
Unlike snow, rain did not impack any boroughs.
From investigation, I was able to find how time of the day affects pick up numbers. Even through out months, popular hours does not change much.
Surprising fact was that weather does not affect number of pickups. Only in Manhattan, there was a difference in pickups when there was lots of snow. Only not significant number of pickups were different in non-working days and working days.
Since all the data was not linear model, I did not create any linear regression model.
As shown in the above graph, it is clear to see that liquid precipitation does not affect number of pickups. Median line shown is uniform shaped which means median does not change whether it rains alot or not.
From the plot above, I was trying to find distribution of hours in number of pickups over month. Interestingly, I was able to find that at 17 in the working hours and at 20 in the night hours, there are most pickups in Manhattan. For other boroughs it was hard to see the pattern.
I tried to see a relationship between total number of pickups comapred to different locations. It was clear that busier cities get more pickups. To find more relationship, I tried to see how weather impacts number of pickups at each boroughs, and interestingly, when there is more snow, there was more pickups in Manhattan.
From this EDA, I was able to find that weathers and holidays don’t play a big role in number of pickups. Rather location and time made bigger impact on number of uber pick ups. Since it was hard to find relationship between features that I thought it would have, it was hard to explorer deeper.
Things went well are my understanding of this data set. It was great thing to find that weathers and holidays do not play a big role in uber pickups. The result was big surprise to me, since I was expecting to see a strong relationship. I believe that with different kind of regression, we can build a predictive model using the dataset.